April 3, 2025
Bayesian Methods treat the standard machine learning problem in a slightly different way
Let \(\boldsymbol \theta\) be the collection of unknowns that we would like to learn from our training data in a way that minimizes generalization error
Let \(\mathcal D\) be the collection of knowns - data, any fixed tuning parameters that are tuned after the fact or just known, etc.
The posterior distribution over the unknowns can be defined using Bayes theorem:
\[f(\boldsymbol \theta | \mathcal D) = \frac{f(\mathcal D | \boldsymbol \theta)f(\boldsymbol \theta)}{\int \limits_{\boldsymbol \theta} f(\mathcal D | \boldsymbol \theta)f(\boldsymbol \theta) d \boldsymbol \theta}\]
\[f(\boldsymbol \theta | \mathcal D) = \frac{f(\mathcal D | \boldsymbol \theta)f(\boldsymbol \theta)}{\int \limits_{\boldsymbol \theta} f(\mathcal D | \boldsymbol \theta)f(\boldsymbol \theta) d \boldsymbol \theta}\]
Likelihood:
\[f(\mathcal D | \boldsymbol \theta)\]
Given a value of the unknowns, what is the likelihood with which we would see any random knowns?
\[f(\mathbf y | \mathbf X , \boldsymbol \beta , \sigma^2) \sim \prod \limits_{i = 1}^N \mathcal N(y_i | \mathbf x_i^T \boldsymbol \beta , \sigma^2) = \mathcal N_N(\mathbf y | \mathbf X \boldsymbol \beta, \sigma^2 \mathcal I_N)\]
\[f(\boldsymbol \theta | \mathcal D) = \frac{f(\mathcal D | \boldsymbol \theta)f(\boldsymbol \theta)}{\int \limits_{\boldsymbol \theta} f(\mathcal D | \boldsymbol \theta)f(\boldsymbol \theta) d \boldsymbol \theta}\]
Prior:
\[f(\boldsymbol \theta)\]
Prior to seeing any data, what are our beliefs/desires for the unknowns?
Chosen for convenience/purpose.
For linear regression (with conjugate prior):
\[f(\boldsymbol \beta) \sim \mathcal N_P(\boldsymbol \beta | \boldsymbol \mu_0 , \boldsymbol \Sigma_0)\]
Conditioning the prior beliefs on the data, we get the posterior distribution:
\[f(\boldsymbol \theta | \mathcal D)\]
A proper PDF over the unknowns
A probability distribution over the unknown quantities
Can define a most likely value, intervals, etc.
Before jumping into regression, let’s look at a simpler example that demonstrate the Bayesian estimation process.
Example: Normal likelihood, known variance
Suppose that we observe \(N\) instances of \(y\) that we believe are i.i.d. draws from a normal distribution with a known value for the variance. \(\mu\) - the mean of the normal - is unknown.
Likelihood:
\[f(\mathbf y | \mu , \sigma^2) = \prod \limits_{i = 1}^N \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left[ - \frac{1}{2 \sigma^2} (y_i - \mu)^2 \right]\]
Our prior distribution over the unknown, \(\mu\), will be a normal distribution:
\[f(\mu) \sim \mathcal N(\mu | \mu_0 , \sigma^2_0)\]
with some arbitrary mean and variance.
We can define our posterior distribution as:
\[f(\mu | \mathbf y) = \frac{\prod \limits_{i = 1}^N \mathcal N(y_i | \mu , \sigma^2)\mathcal N(\mu | \mu_0 , \sigma^2_0)}{\int \limits_{-\infty}^{\infty} \prod \limits_{i = 1}^N \mathcal N(y_i | \mu , \sigma^2)\mathcal N(\mu | \mu_0 , \sigma^2_0) d\mu}\]
This is our posterior distribution!
It’s just not in a useful form
What is the mode of the posterior?
What is the variance of the posterior?
How do we make posteriors useful?
Four strategies:
Analytically find an easy to work with form for the posterior
Find the posterior mode and apply a Laplace approximation
Take a large number of random draws from the posterior (Not covered in this class, see Gibbs and Metropolis approaches)
Find an approximation that maximizes the marginal likelihood
Sometimes, posterior distributions can be analytically simplified
Let’s show an example of this.
\[f(\mu | \mathbf y) = \frac{\prod \limits_{i = 1}^N \mathcal N(y_i | \mu , \sigma^2)\mathcal N(\mu | \mu_0 , \sigma^2_0)}{\int \limits_{-\infty}^{\infty} \prod \limits_{i = 1}^N \mathcal N(y_i | \mu , \sigma^2)\mathcal N(\mu | \mu_0 , \sigma^2_0) d\mu}\]
First simplification:
The denominator is a constant that is not random!
We’re integrating over \(\mu\) and everything else is known!
It’s purpose is to ensure that the posterior integrates to 1.
\[f(\mu | \mathbf y) = \frac{1}{Z} \prod \limits_{i = 1}^N \mathcal N(y_i | \mu , \sigma^2)\mathcal N(\mu | \mu_0 , \sigma^2_0)\]
\[f(\mu | \mathbf y) = \frac{1}{Z} \prod \limits_{i = 1}^N \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left[ - \frac{1}{2 \sigma^2} (y_i - \mu)^2 \right] \frac{1}{\sqrt{2 \pi \sigma_0^2}} \exp \left[ - \frac{1}{2 \sigma_0^2} (\mu - \mu_0)^2 \right]\]
Simplification #2:
We only care about the random terms
e.g. terms that have to do with \(\mu\)
Anything else can be moved to the front constant
\[f(\mu | \mathbf y) = C \prod \limits_{i = 1}^N \exp \left[ - \frac{1}{2 \sigma^2} (y_i - \mu)^2 \right] \exp \left[ - \frac{1}{2 \sigma_0^2} (\mu - \mu_0)^2 \right]\]
The product of exponentials is the sum of the exponents terms:
\[f(\mu | \mathbf y) = C \exp \left[ - \frac{1}{2 \sigma^2} \sum \limits_{i = 1}^N (y_i - \mu)^2 \right] \exp \left[ - \frac{1}{2 \sigma_0^2} (\mu - \mu_0)^2 \right]\]
Applied to wrap the prior in:
\[f(\mu | \mathbf y) = C \exp \left[ - \frac{1}{2 \sigma^2} \sum \limits_{i = 1}^N (y_i - \mu)^2 - \frac{1}{2 \sigma_0^2} (\mu - \mu_0)^2 \right]\]
Expanding the squares and distributing the sum:
\[f(\mu | \mathbf y) \propto \exp \left[ - \frac{1}{2 \sigma^2} \left(\sum y_i^2 - 2 \mu \sum y_i + N \mu^2 \right) - \frac{1}{2 \sigma_0^2} \left(\mu^2 - 2 \mu \mu_0 + \mu_0^2 \right) \right]\]
Combining like terms:
\[f(\mu | \mathbf y) \propto \exp - \frac{1}{2} \bigg(\frac{1}{\sigma^2} \sum y_i^2 - \frac{1}{\sigma^2_0} \mu_0 \\- 2 \mu \left(\frac{1}{\sigma^2}\sum y_i + \frac{1}{\sigma^2_0} \mu_0 \right) + \mu^2 \left(\frac{N}{\sigma^2} + \frac{1}{\sigma^2_0} \right) \bigg)\]
Now, take advantage of the fact that we can get rid of any multiplicative terms that don’t have to do with \(\mu\)
\[f(\mu | \mathbf y) = C' \exp \left[ - \frac{1}{2} \left(- 2 \mu \left(\frac{1}{\sigma^2}\sum y_i + \frac{1}{\sigma^2_0} \mu_0 \right) + \mu^2 \left(\frac{N}{\sigma^2} + \frac{1}{\sigma^2_0} \right) \right) \right]\]
This looks a lot like the kernel of a normal distribution:
\[C \exp\left[-\frac{1}{2 s^2}(z - m)^2 \right] = C' \exp\left[-\frac{1}{2 s^2}(z^2 - 2 z m) \right]\]
Can we put this in the form of the Gaussian kernel?
The Gaussian identity for random variable \(y\): \[C' \exp\left[-\frac{1}{2}(ay^2 - 2 b y) \right]\]
is equivalent to a normal distribution (up to a multiplicative constant) where:
\[\hat{\sigma}^2 = a^{-1}\]
\[\hat{\mu} = -a^{-1}b\]
This means that our posterior is a normal distribution:
\[f(\mu | \mathbf y) \sim \mathcal N(\mu | \hat{\mu} , \hat{\sigma}^2)\]
where
\[\hat{\sigma}^2 = \left( \frac{N}{\sigma^2} + \frac{1}{\sigma^2_0} \right)^{-1}\]
and
\[\hat{\mu} = \left( \frac{N}{\sigma^2} + \frac{1}{\sigma^2_0} \right)^{-1}\left(\frac{1}{\sigma^2}\sum y_i + \frac{1}{\sigma^2_0} \mu_0 \right)\]
That’s convenient that we ended up with a normal posterior distribution!
It’s not a coincidence
Conjugate Prior
For certain likelihood forms, the posterior is of the form of the prior
Normal/Normal
Beta/Bernoulli
Multivariate Normal/Multivariate Normal
Prior form is chosen out of convenience!
Interesting observation:
\[\hat{\mu} = \left( \frac{N}{\sigma^2} + \frac{1}{\sigma^2_0} \right)^{-1}\left(\frac{1}{\sigma^2}\sum y_i + \frac{1}{\sigma^2_0} \mu_0 \right)\]
If the prior variance is small (meaning a really precise prior or we have a strong belief about the location of \(\mu\)), the posterior mean will be really close to \(\mu_0\)
The numerator will be dominated by the value of the prior mean
Interesting observation:
What happens if we have a prior with infinite variance?
\[\hat{\sigma}^2 = \left( \frac{N}{\sigma^2} + \frac{1}{\sigma^2_0} \right)^{-1} = \frac{\sigma^2}{N}\]
and
\[\hat{\mu} = \left( \frac{N}{\sigma^2} + \frac{1}{\sigma^2_0} \right)^{-1}\left(\frac{1}{\sigma^2}\sum y_i + \frac{1}{\sigma^2_0} \mu_0 \right) = \frac{\sigma^2}{N} \frac{\sum y_i}{\sigma^2} = \frac{\sum y_i}{N}\]
The moments of the posterior for the mean become the maximum likelihood estimates: the sample average and the standard error.
We can think of MLE as a special case of Bayesian methods where we have diffuse priors
Another interesting result:
As \(N \to \infty\) with a non-infinite variance prior
\[\hat{\sigma}^2 \approx \frac{\sigma^2}{N}\]
\[\hat{\mu} \approx \frac{\sum y_i}{N}\]
At a certain point, our prior doesn’t matter!
We can see MLE as the asymptotic limit of the posterior distribution as \(N\) gets large
How is this useful for us?
Let’s now look at the linear regression problem. We want to find the posterior over the regression coefficients given the data we observe and assuming a known variance:
\[f(\boldsymbol \beta | \mathbf X , \mathbf y, \sigma^2)\]
The likelihood:
\[f(\mathbf y | \mathbf X , \boldsymbol \beta , \sigma^2) = \prod \limits_{i = 1}^N \mathcal N(y_i | \mathbf x^T \boldsymbol \beta , \sigma^2)\]
Because this is a product of independent normals, we can represent this via a \(N\) dimensional multivariate normal distribution:
\[f(\mathbf y | \mathbf X , \boldsymbol \beta , \sigma^2) \sim \mathcal N_N(\mathbf y | \mathbf X \boldsymbol \beta , \sigma^2 \mathcal I_N)\]
The prior:
Let’s say that the prior over the \(P\) coefficients follows a multivariate normal distribution with mean \(\boldsymbol \mu_0\) and covariance matrix \(\boldsymbol \Sigma_0\)
\[f(\boldsymbol \beta) \sim \mathcal N_P(\boldsymbol \beta | \boldsymbol \mu_0 , \boldsymbol \Sigma_0)\]
This means that our posterior is:
\[f(\boldsymbol \beta | \mathbf X , \mathbf y, \sigma^2) = \frac{1}{Z} \mathcal N_N(\mathbf y | \mathbf X \boldsymbol \beta , \sigma^2 \mathcal I_N) \mathcal N_P(\boldsymbol \beta | \boldsymbol \mu_0 , \boldsymbol \Sigma_0)\]
What we know:
This means that:
\[f(\boldsymbol \beta | \mathbf X , \mathbf y, \sigma^2) \sim \mathcal N_P(\boldsymbol \beta | \hat{\boldsymbol \mu} , \hat{\boldsymbol \Sigma})\]
The proof for this is quite long, but follows closely to the normal/normal proof we did above. It is outlined in all its glory in the lecture notes for Bayesian Regression included on the Canvas site.
The punchline:
\[f(\boldsymbol \beta | \mathbf X , \mathbf y, \sigma^2) \sim \mathcal N_P(\boldsymbol \beta | \hat{\boldsymbol \mu} , \hat{\boldsymbol \Sigma})\]
\[\hat{\boldsymbol \Sigma} = \left(\frac{1}{\sigma^2} \mathbf X^T \mathbf X + \boldsymbol \Sigma_0^{-1} \right)^{-1}\]
\[\hat{\boldsymbol \mu} = \hat{\boldsymbol \Sigma}\left(\frac{1}{\sigma^2} \mathbf X^T \mathbf y + \boldsymbol \Sigma^{-1}_0 \boldsymbol \mu_0 \right)\]
Implication 1:
When all of the prior variances are diffuse
\[\hat{\boldsymbol \mu} = \left(\frac{1}{\sigma^2} \mathbf X^T \mathbf X \right)^{-1}\left(\frac{1}{\sigma^2} \mathbf X^T \mathbf y \right) = (\mathbf X^T \mathbf X)^{-1}\mathbf X^T \mathbf y\]
OLS is the solution when we have no prior info!
This also holds when the number of training samples goes to \(\infty\)
Implication 2:
Let’s set \(\boldsymbol \mu_0 = \mathbf 0\) and \(\boldsymbol \Sigma_0 = \tau^2 \mathcal I_P\)
This means that:
\[\hat{\boldsymbol \Sigma} = \left(\frac{1}{\sigma^2} \mathbf X^T \mathbf X + \frac{1}{\tau^2} \mathcal I_P \right)^{-1}\]
\[\hat{\boldsymbol \mu} = \hat{\boldsymbol \Sigma}\left(\frac{1}{\sigma^2} \mathbf X^T \mathbf y + \frac{1}{\tau^2} \boldsymbol \mu_0 \right)\]
Let \(\lambda = \frac{\tau^2}{\sigma^2}\). We can rearrange the posterior mean to be:
\[\hat{\boldsymbol \mu} = (\mathbf X^T \mathbf X + \lambda \mathcal I_P)^{-1} \mathbf X^T \mathbf y\]
which is the analytical solution for Ridge regression
Regularization with the L2 norm is Bayesian regression with a normal prior on the coefficients centered at 0!
Regularization methods are us setting up desired or believed values for the regression coefficients
The L2 penalty (therefore weight decay) is a Bayesian solution to the generalization problem
The strength of this regularization is then the level at which we want to simplify our results
The Bayesian approach also justifies letting the model be more complex as \(N\) grows
I find this to be a satisfying justification for regularization!
Bayesian methods are regularized by construction
Regularization methods are us setting up desired or believed values for the regression coefficients
The L2 penalty (therefore weight decay) is a Bayesian solution to the generalization problem
The strength of this regularization is then the level at which we want to simplify our results
The prior variance serves the same purpose as \(\lambda\) in regularized methods!
Set the prior variance in accordance with how simple we would like our model to be
Choose the prior variance in such a way that we minimize generalization error!
Bayesian machine learning is regularized machine learning
Simplicity reigns supreme
Allow models to grow more complex as \(N\) goes to \(\infty\)
The prior doesn’t matter so much as \(N\) gets large
Aside from the regularization benefits, it also allows us to reasonably estimate coefficients when \(P \approx N\) of \(P > N\).
\[\hat{\boldsymbol \mu} = \left(\frac{1}{\sigma^2} \mathbf X^T \mathbf X + \tau^2 \mathcal I_P \right)^{-1}\left(\frac{1}{\sigma^2} \mathbf X^T \mathbf y \right)\]
When \(P > N\), \(\mathbf X^T \mathbf X\) is not invertible as the determinant is zero
Adding a positive constant to the diagonal makes the matrix invertible
We can interpret this in terms of the prior:
If \(\mathbf X^T \mathbf y\) is large, let the data tell us the coefficient value
If \(\mathbf X^T \mathbf y\) is smaller, default back to the prior and say that the coefficient is essentially zero
In an undetermined system, just default back to zero for any excess coefficients.
Bayesian methods aren’t just a theoretical justification for regularization methods.
Gaussian Processes for really curvy functional forms
Hierarchical Models (coefficients are nested due to cluster structure)
Relevance Vector Machines (an SVM variant that allows one to skip cross validation in selecting the cost)
The really common one:
Suppose we have a feature matrix, \(\mathbf X\), that is a \(N \times P\) matrix.
We believe that \(X\) can be represented in a lower dimensional space by the PDF \(Z \in \mathbb R^{M}\) where \(M << P\).
PCA:
\[f(\mathbf X) \approx f(\mathbf Z) \text{ ; } \mathbf x_i = \lambda^T \mathbf z_i\]
Maybe the mapping is nonlinear
\[f(\mathbf X) \approx f(\mathbf Z) \text{ ; } \mathbf x_i = g(z_i)\]
Then the goal is to find
\[g(\mathbf z_i)\]
that closely approximates \(\mathbf x_i\).
This general problem is called an autoencoder
We know that the posterior for Bayesian linear regression is a multivariate normal distribution.
But we often want to present a summary of the different coefficients that is more easily interpreted.
Two common quantities:
The maximum a posteriori (MAP) estimate
A 95% credible interval
The MAP estimate
Given a posterior \(f(\boldsymbol \theta | \mathcal D)\), the MAP estimate is the point over the unknowns associated with the highest density
\[\hat{\boldsymbol \theta} = \underset{\boldsymbol \theta}{\text{argmax }} f(\boldsymbol \theta | \mathcal D)\]
For normal posteriors, this is easy as the high density point is the mean:
\[\hat{\boldsymbol \beta} = \left(\frac{1}{\sigma^2} \mathbf X^T \mathbf X + \boldsymbol \Sigma_0^{-1} \right)^{-1}\left(\frac{1}{\sigma^2} \mathbf X^T \mathbf y + \boldsymbol \Sigma^{-1}_0 \boldsymbol \mu_0 \right)\]
We also want to summarize the uncertainty surrounding our MAP estimates
This can always be determined by our posterior variance, \(\hat{\boldsymbol \Sigma}\)
Instead of the posterior variance, we might want to present intervals around the MAP estimates of a specific size
Bayesian 95% Credible Intervals
A region of the posterior that includes 95% of the posterior mass. Conditioned on the data and our prior, a 95% chance that the unknowns lie in this region
For multivariate normals, the marginal posterior and corresponding marginal 95% HPD intervals are easy to compute.
Marginals:
The marginal distribution for each element of a random vector that follows a multivariate normal distribution follows a normal distribution with mean \(\hat{\boldsymbol \mu}_j\) and variance \(\hat{\boldsymbol \Sigma}_{j,j}\)
Marginal 95% HPD:
Since the normal distribution is symmetric around the mean, the 95% HPD is then:
\[\hat{\boldsymbol \mu}_j \pm 1.96 \hat{\boldsymbol \Sigma}_{j,j}\]
Letting \(\boldsymbol \Sigma_0\) be diffuse, this is equivalent to your standard ol’ confidence interval
All these years, you could’ve been interpreting confidence intervals in the way that you want…
Sometimes we can find the posterior via analytical methods
What if our prior/likelihood doesn’t admit this kind of simplification?
The Laplace distribution:
\[f(x | \mu , b) = \frac{1}{2b} \exp \left[- \frac{|x - \mu|}{b} \right]\]
The Laplace distribution is sharper at the peak than the normal distribution.
Maybe we want to define a regression model with independent Laplace priors on the coefficients!
Let’s make the prior over our regression coefficients independent Laplace priors:
\[f(\boldsymbol \beta) = \prod \limits_{j = 1}^P \frac{1}{2b_{j0}} \exp \left[- \frac{|\beta_j - \mu_{j0}|}{b_{j0}} \right]\]
Then the posterior using normal errors with constant variance becomes:
\[ f(\boldsymbol \beta | \mathbf X , \mathbf y , \sigma^2) = \frac{\prod \limits_{i = 1}^N \mathcal N(y_i | \mathbf x_i^T \boldsymbol \beta , \sigma^2)\prod \limits_{j = 1}^P \frac{1}{2b_{j0}} \exp \left[- \frac{|\beta_j - \mu_{j0}|}{b_{j0}} \right]}{\int \limits_{\boldsymbol \beta} \prod \limits_{i = 1}^N \mathcal N(y_i | \mathbf x_i^T \boldsymbol \beta , \sigma^2)\prod \limits_{j = 1}^P \frac{1}{2b_{j0}} \exp \left[- \frac{|\beta_j - \mu_{j0}|}{b_{j0}} \right] d \boldsymbol \beta} \]
Our posterior isn’t going to simplify!
The analytical approach:
\[\text{Posterior} \rightarrow \text{MAP} \rightarrow \text{Uncertainty}\]
The MAP approach:
\[\text{MAP} \rightarrow \text{Uncertainty} \rightarrow \text{Posterior}\]
Find the posterior summary quantities and then approximate the posterior.
This will work quite well in a lot of scenarios!
Will also be viable and quick when other approaches won’t.
Specifically, find the multivariate normal distribution over the unknowns that most closely approximates the posterior.
A result that I won’t prove - Bernstein-Von Mises Theorem.
BVM states:
Under some regularity conditions on the likelihood and prior, if the log of the unnormalized posterior is continuous and twice differentiable, then the posterior converges in distribution to a multivariate normal distribution with mean:
\[\hat{\boldsymbol \mu} = \underset{\boldsymbol \Theta}{\text{argmax }}\log f(\boldsymbol \Theta | \mathbf X)\]
and covariance matrix:
\[\hat{\boldsymbol \Sigma} = -\left[ \frac{\partial \log f(\hat{\boldsymbol \mu} | \mathbf X)}{\partial \boldsymbol \Theta \partial \boldsymbol \Theta} \right]^{-1} \]
as \(N \to \infty\).
Additionally, this approximation is the multivariate normal approximation to the true posterior distribution that minimizes the Kullback-Leibler divergence between the true and approximate MVN.
Let \(P = \mathcal N(\boldsymbol \Theta | \hat{\boldsymbol \mu}, \hat{\boldsymbol \Sigma})\) and let \(Q = f(\boldsymbol \Theta | \mathcal D)\). Define the KL divergence as:
\[D_{KL}(P || Q) = \int \limits_\mathbf x P(\mathbf x) \log \frac{P(\mathbf x)}{Q(\mathbf x)} d\mathbf x\]
When \(P(\mathbf x) = Q(\mathbf x) \forall \mathbf x \in \mathbb R^P\), this “distance” evaluates to zero.
Grows in difference between \(P(\mathbf x)\) and \(Q(\mathbf x)\)
The MAP approximation is the best we can do with a MVN regardless of sample size!
Let’s do a MAP approximation for the MVN/MVN linear regression case.
Our (unnormalized) posterior is:
\[f(\boldsymbol \beta | \mathbf X , \mathbf y, \sigma^2) \propto \mathcal N_N(\mathbf y | \mathbf X \boldsymbol \beta , \sigma^2 \mathcal I_N) \mathcal N_P(\boldsymbol \beta | \boldsymbol \mu_{0} , \boldsymbol \Sigma_0)\]
With formula substituted and getting rid of terms that don’t have to do with \(\boldsymbol \beta\):
\[ \begin{split} f(\boldsymbol \beta | \mathbf X , \mathbf y , \sigma^2) \propto & \exp \bigg[-\frac{1}{2} (\mathbf y - \mathbf X \boldsymbol \beta)^T \frac{1}{\sigma^2} \mathcal I_N (\mathbf y - \mathbf X \boldsymbol \beta) \bigg] \\ & \exp \bigg[-\frac{1}{2} (\boldsymbol \beta - \boldsymbol \mu_0)^T \boldsymbol \Sigma_0^{-1} (\boldsymbol \beta - \boldsymbol \mu_0) \bigg] \end{split} \]
To make this more applicable, let’s set \(\boldsymbol \mu_0 = \mathbf 0\) and \(\boldsymbol \Sigma_0 = \tau^2 \mathcal I_P\).
\[ \begin{split} f(\boldsymbol \beta | \mathbf X , \mathbf y , \sigma^2) \propto & \exp \bigg[-\frac{1}{2 \sigma^2} (\mathbf y - \mathbf X \boldsymbol \beta)^T (\mathbf y - \mathbf X \boldsymbol \beta) -\frac{1}{2\tau^2} \boldsymbol \beta^T \boldsymbol \beta \bigg] \end{split} \]
First step: take the log
\[ \begin{split} \log f(\boldsymbol \beta | \mathbf X , \mathbf y , \sigma^2) \propto & -\frac{1}{2 \sigma^2} (\mathbf y - \mathbf X \boldsymbol \beta)^T (\mathbf y - \mathbf X \boldsymbol \beta) -\frac{1}{2\tau^2} \boldsymbol \beta^T \boldsymbol \beta \end{split} \]
To make things even easier, let’s let \(\lambda = \frac{\sigma^2}{\tau^2}\):
\[\log f(\boldsymbol \beta | \mathbf X , \mathbf y , \sigma^2) \propto -\frac{1}{2} \left[ (\mathbf y - \mathbf X \boldsymbol \beta)^T (\mathbf y - \mathbf X \boldsymbol \beta) + \lambda \boldsymbol \beta^T \boldsymbol \beta \right] \]
Second step: take the derivative w.r.t. \(\boldsymbol \beta\):
\[ \frac{d}{d \boldsymbol \beta} \log f(\boldsymbol \beta | \mathbf X , \mathbf y , \sigma^2) \propto \frac{d}{d \boldsymbol \beta} -\frac{1}{2} \left[ \mathbf y^T \mathbf y - 2 \boldsymbol \beta \mathbf X^T \mathbf y + \boldsymbol \beta^T \left[\mathbf X^T \mathbf X + \lambda + \mathcal I_P \right] \boldsymbol \beta \right] \]
\[ \frac{d}{d \boldsymbol \beta} \log f(\boldsymbol \beta | \mathbf X , \mathbf y , \sigma^2) \propto -\frac{1}{2} \left[ - 2 \mathbf X^T \mathbf y + 2 \left[\mathbf X^T \mathbf X + \lambda \mathcal I_P \right] \boldsymbol \beta \right] \]
Setting to zero and solving for \(\boldsymbol \beta\):
\[\hat{\boldsymbol \beta} = (\mathbf X^T \mathbf X + \lambda \mathcal I_P)^{-1} \mathbf X^T \mathbf y\]
which matches our posterior mean!
The second part is to get the covariance matrix for the MVN approximation:
\[\hat{\boldsymbol \Sigma} = -\left[ \frac{\partial \log f(\hat{\boldsymbol \mu} | \mathbf X)}{\partial \boldsymbol \Theta \partial \boldsymbol \Theta} \right]^{-1} \]
This requires finding the Hessian and evaluating it at the MAP estimate
This covariance estimation method is called the Laplace approximation
It is also (roughly) equivalent to the Fisher information method approach for MLE
This approach is often referred to as Bayesian MLE
\[ \frac{d}{d \boldsymbol \beta} \log f(\boldsymbol \beta | \mathbf X , \mathbf y , \sigma^2) \propto -\frac{1}{2} \left[ - 2 \mathbf X^T \mathbf y + 2 \left[\mathbf X^T \mathbf X + \lambda \mathcal I_P \right] \boldsymbol \beta \right] \]
\[ \frac{d^2}{d \boldsymbol \beta^2} \log f(\boldsymbol \beta | \mathbf X , \mathbf y , \sigma^2) \propto -\frac{1}{2} 2 \mathbf X^T \mathbf X + \lambda \mathcal I_P \]
And the negative inverse is:
\[\hat{\boldsymbol \Sigma} = \left[ \mathbf X^T \mathbf X + \lambda \mathcal I_P \right]^{-1}\]
matching the posterior covariance!
The MAP approximation plus Laplace approximation yields the exact same posterior when the true posterior is MVN!
The quality of this approximation is going to depend on the MVN-ness of the actual posterior distribution.
When \(N\) is large and the number of parameters isn’t too large, BVM states that posteriors become MVN!
Now, let’s go back to regression with Laplace priors:
\[ f(\boldsymbol \beta | \mathbf X , \mathbf y , \sigma^2) = \frac{\prod \limits_{i = 1}^N \mathcal N(y_i | \mathbf x_i^T \boldsymbol \beta , \sigma^2)\prod \limits_{j = 1}^P \frac{1}{2b_{j0}} \exp \left[- \frac{|\beta_j - \mu_{j0}|}{b_{j0}} \right]}{\int \limits_{\boldsymbol \beta} \prod \limits_{i = 1}^N \mathcal N(y_i | \mathbf x_i^T \boldsymbol \beta , \sigma^2)\prod \limits_{j = 1}^P \frac{1}{2b_{j0}} \exp \left[- \frac{|\beta_j - \mu_{j0}|}{b_{j0}} \right] d \boldsymbol \beta} \]
Let’s look at the unnormalized posterior for a simple linear regression with intercept zero and true coefficient of 0 with no prior, a normal prior, and a Laplace prior.
The Laplace prior places a little extra mass at 0 here.
Do you think that the Laplace approximation will work well here?
Regardless, let’s find the MAP estimate for this regression.
We’re going to make a simplifying assumption here: \(\mu_{j,0} = 0\) and \(b_{j,0} = b_0\) for all coefficients except for the intercept. Instead, we’re going to place a normal prior on the intercept with mean 0 and variance \(\sigma^2_{a}\). Denote the intercept as \(\alpha\). Then our posterior becomes:
\[ f(\boldsymbol \beta, \alpha | \mathbf X , \mathbf y , \sigma^2) = \frac{\prod \limits_{i = 1}^N \mathcal N(y_i | \mathbf x_i^T \boldsymbol \beta + \alpha, \sigma^2)\prod \limits_{j = 1}^P \frac{1}{2b_{0}} \exp \left[- \frac{|\beta_j|}{b_{0}} \right] \mathcal N(\alpha | 0 , \sigma^2_a)}{\int \limits_{\boldsymbol \beta, \alpha} \prod \limits_{i = 1}^N \mathcal N(y_i | \mathbf x_i^T \boldsymbol \beta + \alpha, \sigma^2)\prod \limits_{j = 1}^P \frac{1}{2b_{0}} \exp \left[- \frac{|\beta_j|}{b_{0}} \right] \mathcal N(\alpha | 0 , \sigma^2_a) d \boldsymbol \beta d \alpha }\]
Looking only at the numerator:
\[ \begin{split} f(\boldsymbol \beta, \alpha | \mathbf X , \mathbf y , \sigma^2) \propto \prod \limits_{i = 1}^N \mathcal N(y_i | \mathbf x_i^T \boldsymbol \beta + \alpha, \sigma^2)\prod \limits_{j = 1}^P \frac{1}{2b_{0}} \exp \left[- \frac{|\beta_j|}{b_{0}} \right] \mathcal N(\alpha | 0 , \sigma^2_a) \end{split} \]
The log posterior is then:
\[ \begin{split} \log f(\boldsymbol \beta, \alpha | \mathbf X , \mathbf y , \sigma^2) \propto & \sum \limits_{i = 1}^N -\frac{1}{2} \log(2 \pi \sigma^2) - \\ &\frac{1}{2 \sigma^2} (y_i - \mathbf x_i^T \boldsymbol \beta - \alpha)^T (y_i - \mathbf x_i^T \boldsymbol \beta - \alpha) + \\ & \sum \limits_{j = 1}^P - \log(2 b_0) - \frac{|\beta_j|}{b_0} \\ & -\frac{1}{2} \log(2 \pi \sigma^2_a) - \frac{1}{2 \sigma^2_a} \alpha^T \alpha \end{split} \]
Let’s reduce this into a more workable form by getting rid of any constants and simplifying our sum terms:
\[ \begin{split} \log f(\boldsymbol \beta, \alpha | \mathbf X , \mathbf y , \sigma^2) \propto -\frac{1}{2 \sigma^2} \sum \limits_{i = 1}^N (y_i - \mathbf x_i^T \boldsymbol \beta - \alpha)^T (y_i - \mathbf x_i^T \boldsymbol \beta - \alpha) - & \\ \frac{1}{b_0} \sum \limits_{j = 1}^P |\beta_j| - \frac{1}{2 \sigma^2_a} \alpha^T \alpha \end{split} \]
\[ \begin{split} \log f(\boldsymbol \beta, \alpha | \mathbf X , \mathbf y , \sigma^2) \propto & - \bigg[ \frac{1}{2 \sigma^2} \sum \limits_{i = 1}^N (y_i - \mathbf x_i^T \boldsymbol \beta - \alpha)^T (y_i - \mathbf x_i^T \boldsymbol \beta - \alpha) + \\ & \frac{1}{b_0} \sum \limits_{j = 1}^P |\beta_j| + \frac{1}{2 \sigma^2_a} \alpha^T \alpha \bigg] \end{split} \]
Finally, multiplying everything by \(2 \sigma^2\) (which we can do because everything is just proportional) and assuming \(\sigma^2_a\) is large (a weak prior):
\[ \begin{split} \log f(\boldsymbol \beta, \alpha | \mathbf X , \mathbf y , \sigma^2) \propto & - \bigg[ \sum \limits_{i = 1}^N (y_i - \mathbf x_i^T \boldsymbol \beta - \alpha)^2 + \frac{2 \sigma^2}{b_0} \sum \limits_{j = 1}^P |\beta_j| \bigg] \end{split} \]
Recognize this equation?
The interior of this log posterior is a familiar form: the LASSO!
By placing independent Laplace priors on the coefficients, we are able to recover the sum of squared errors plus an L1 penalty term.
The intercept term is treated differently so it is not shrunk towards zero.
Okay. Let’s try to find the MAP estimate.
Disappointment!
There is no closed form solution for the coefficients.
That pesky absolute value ruins everything…
Bummer.
However, we can derive the gradient for these coefficients
\[\nabla_{\alpha} = \frac{1}{N} \sum \limits_{i = 1}^N (y_i - \mathbf x_i^T \boldsymbol \beta - \alpha)\]
or the average residual given \(\boldsymbol \beta\) and \(\alpha\).
\[ \nabla_{\boldsymbol \beta} = - \frac{\lambda}{N} \text{sign}(\boldsymbol \beta) + \frac{1}{N} (\mathbf y - \mathbf X \boldsymbol \beta - \tilde{\boldsymbol \alpha})^T\mathbf X\]
where \(\tilde{\boldsymbol \alpha}\) is a \(N\) vector with the value for the intercept repeated \(N\) times and \(\lambda = \frac{2\sigma^2}{b_0}\)
Given this gradient, we can use gradient based optimizers to the find a local maximum for the Bayesian LASSO objective function and generate a MAP estimate.
We can also evaluate the Hessian for this fit to get a MAP approximation for the Bayesian LASSO
Not done frequently because normality of the posterior is achieved when \(N\) is quite large and the optimal \(\lambda\) is quite small
In other words, it doesn’t work well except in the case where OLS is essentially equivalent.
MAP approximations provide a tractable way to do Bayesian inference for different models where a conjugate prior form is not available.
Correct in the limit
Close enough in a lot of different scenarios
Basis for a common machine learning technique called variational inference
Recall that the posterior for the unknowns given the knowns and a prior can be written as:
\[ f(\boldsymbol \theta | \mathbf X) = \frac{f(\mathbf X | \boldsymbol \theta)f(\boldsymbol \theta)}{f(\mathbf X)} = \frac{f(\mathbf X | \boldsymbol \theta)f(\boldsymbol \theta)}{\int \limits_{\boldsymbol \theta} f(\mathbf X | \boldsymbol \theta)f(\boldsymbol \theta) d \boldsymbol \theta} \]
The numerator is random with respect to the unknowns
The denominator is a constant since it is computed marginalizing over the unknowns.
e.g. the marginal likelihood of the data
The denominator has been written in a way to reduce the complexity of the term. For regression exercises treating \(\sigma^2\) as known, this denominator is really:
\[f(\mathbf y | \mathbf X , \mathbf \sigma^2, \boldsymbol \Pi)\]
where \(\boldsymbol \Pi\) is the collection of prior hyperparameters (e.g. the prior mean and covariance).
This quantity tells us the likelihood we would observe the outcomes given any fixed parameters marginalizing over the unknowns.
Let’s think about what this means
Given the prior and model choice (the unknown parameters), what is the probability that we would see the outcomes w.r.t. my prior?
Larger values correspond to good fits to the data.
Smaller values indicate that the data isn’t well explained by the model.
Key point: the marginal likelihood is a Bayesian measure of the generalization error for a model!
If the prior encodes the probability weighted viable values of the unknowns, then we’re assessing how well the model fits the data w.r.t all values implied by the prior
The more overfit a model is, the more the “quality” of the fit depends on having just the right coefficients
Little changes in the unknowns lead to large drops in fit!
Can also show that the marginal likelihood is roughly equivalent to leave one out cross validation
\[LOOCV(\mathbf y) = \frac{1}{N} \sum \limits_{i = 1}^N \log f(y_i | \mathbf x_i , \hat{\boldsymbol \Theta}_i(y^{(-i)},\mathbf X^{(-i)}))\]
Log marginal likelihood:
\[\log f(\mathbf y | \mathbf X , \boldsymbol \Pi)\]
We don’t want to make the assumption that each \(y_i\) is an i.i.d. draw from the distribution of the knowns.
In the regression case, we construct the problem such that \(f(y_i | \mathbf x_i , \boldsymbol \beta, \sigma^2)\) is independent of \(f(y_j | \mathbf x_j , \boldsymbol \beta, \sigma^2)\) conditional on the coefficients.
However, we do not know it cannot be true that \(f(y_i | \mathbf x_i , \sigma^2)\) is independent of \(f(y_j | \mathbf x_j , \sigma^2)\) since they share a common DGP relying on \(\boldsymbol \beta\).
Always true, though:
\[\log f(\mathbf y | \mathbf X , \boldsymbol \Pi) = \sum \limits_{i = 1}^N \log f(y_i | \mathbf y^{(-i)}, \mathbf X^{(-i)}, \boldsymbol \Pi)\]
The two are almost equivalent!
Difference is LOOCV requires \(N\) computations of the model while the marginal likelihood is computed once
This is great! Why is this the first time you’re hearing about it?
It can be quite difficult to compute. Generally, the integral:
\[\int \limits_{\boldsymbol \theta} f(\mathbf X | \boldsymbol \theta)f(\boldsymbol \theta) d \boldsymbol \theta\]
is intractible and cannot be computed analytically.
However, there are some cases where we can compute this integral!
One is linear regression with a conjugate normal prior.
Sparing the gory details of this derivation (see Chapter 2 of PML2 for these), the marginal likelihood for a regression model with a MVN likelihood and MVN prior is:
\[ \begin{split} \int \limits_{\boldsymbol \beta} & \mathcal N_N(\mathbf y | \mathbf X \boldsymbol \beta , \sigma^2 \mathcal I_N) \mathcal N_P(\boldsymbol \beta | \boldsymbol \mu_0 , \boldsymbol \Sigma_0) d \boldsymbol \beta = \\ & \mathcal N_N(\mathbf y | \mathbf X \boldsymbol \mu_0 , \sigma^2 \mathcal I_N + \mathbf X \boldsymbol \Sigma_0 \mathbf X^T) \end{split} \]
An additional case where we can compute the marginal likelihood is when we use the MAP estimate with Laplace approximation to get an approximate posterior.
\[\int f(\mathbf X | \boldsymbol \theta)f(\boldsymbol \theta) d\boldsymbol \theta \approx f(\mathbf y | \mathbf X, \hat{\boldsymbol \mu}) f(\hat{\boldsymbol \mu}) (2 \pi)^{P/2} \text{det}\left[\hat{\boldsymbol \Sigma} \right] N^{-P/2}\]
When can we use this?
Generally, Bayesian methods try to approximate this high dimensional integral using Monte Carlo methods (see chapter 3 of my dissertation)
Selecting \(\lambda\) for ridge regression
The ARD prior (outlined in the lecture notes)
Recall that Ridge regression can be expressed as a Bayesian problem:
Multivariate normal likelihood with mean \(\mathbf X \boldsymbol \beta\) and covariance matrix \(\sigma^2 \mathcal I_N\)
Multivariate normal prior with mean \(\mathbf 0\) and covariance matrix \(\tau^2 \mathcal I_P\)
We can always rescale the problem such that the likelihood has covariance \(\mathcal I_N\) and the prior has covariance \(\frac{\sigma^2}{\tau^2} \mathcal I_P\).
The marginal likelihood given \(\lambda\) is then:
\[f(\mathbf X, y | \lambda) = \mathcal N_N(\mathbf y | \mathbf 0 , \mathcal I_N + \lambda \mathbf X \mathbf X^T)\]
Goal: find \(\hat{\lambda}\) such that:
\[ \hat{\lambda} = \underset{\hat{\lambda}}{\text{argmax }} f(\mathbf X, \mathbf y | \lambda) \]
Integrating over the unknowns/learnable parameters, find the value of the nuisance parameter that maximizes the marginal likelihood
Plug this in or jointly learn the corresponding parameter values to fully specify the posterior!
Bayesian machine learning is a common framework
This is just a fast overview of important topics
PML1 and PML2 are great references for this material
Variational Inference for Autoencoders!
One other quick bite:
The posterior is a distribution over the unknowns. If our prediction made at any feature vector, \(\mathbf x_0\) is a function of these unknowns, then the prediction also follows a distribution.
Posterior predictive distribution
\[f(y_0 | \mathbf x_0 , \mathbf X , \mathbf y, \hat{\boldsymbol \theta}) = \int \limits_{\hat{\boldsymbol \theta}} f(\mathbf y_0 | \mathbf x_0, \mathbf y , \mathbf X, \hat{\boldsymbol \theta}) f(\hat{\boldsymbol \theta} | \mathcal D) d \hat{\boldsymbol \theta}\]
For the MVN/MVN regression model, the posterior predictive distribution has a closed form:
\[f(y_0 | \mathbf x_0 , \sigma^2, \mathbf X , \mathbf y) = \mathcal N(y_0 | \hat{\boldsymbol \mu}^T \mathbf x_0 , \sigma^2 + \mathbf x_0^T \hat{\boldsymbol \Sigma} \mathbf x_0)\]
Can be used to assess the quality of the fit to the training data
See 11.7.4 in PML1 for more discussion.
Note that this is the MAP estimate of the prediction with variance equal to the error variance plus the model variance (variance over the coefficients) evaluated at the feature vector!
<Figure size 960x480 with 0 Axes>
<Figure size 960x480 with 0 Axes>